Round 1 – SQL, DSA, and PySpark (Technical)
General Introduction
🔹 Discussed my previous tech stack and experiences.
Questions on:
🔹 Data volumes handled
🔹 Type of work done
🔹 Business impact of the work
🔹 SQL Questions
Identify continuous date ranges having the same status
Input:
Expected Output:
Approach: Use ROW_NUMBER() with grouping to find gaps and identify ranges.
Join Output Count
Given 2 tables with NULL values, determine output row count for different joins:
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER JOIN
🔹 DSA Coding Question
Group Anagrams
Input: ["eat", "tea", "tan", "ate", "nat", "bat"]
Output: [["bat"], ["tan", "nat"], ["eat", "tea", "ate"]]
Approach: Use a dictionary with sorted strings as keys to group anagrams.
PySpark
🔹 Union of Two Datasets with Different Schemas
🔹 Write a PySpark script to union two datasets having different:
Row counts
Column names
Schemas
Approach: Use column alignment and selectExpr or unionByName with allowMissingColumns=True.
Round 2 – PySpark and SQL
PySpark Tasks
🔹 Word Frequency in Text File
🔹 Read a text file using PySpark.
🔹 Return each word and its frequency.
🔹 Approach: Use flatMap, map, reduceByKey.
🔹 Spark Internals
🔹 Explain Spark architecture.
🔹 Jobs vs Stages vs Tasks.
Given code snippets
🔹 Identify narrow vs wide transformations.
🔹 Determine number of stages.
🔹 Employee Salary and Department Name
Task: Compute yearly salary for each employee and their department name.
SQL Tasks
🔹 Left Anti Join
🔹 Find number of records present in table 1 but not in table 2.
Data Modelling Task
🔹 Given: Customer, Product, and Orders tables.
🔹 Identify dimension vs fact tables.
🔹 Validate schema:
Fact PK should not be FK in a dim table
Round 3 – Spark and SQL
Spark Discussion
🔹 Spark architecture end-to-end
Spark memory management:
🔹 For small clusters, small data, and large data
Hardcoded reserved memory:
🔹 Can it be modified?
🔹 When should it be changed?
Spark 2 vs Spark 3:
🔹 AQE (Adaptive Query Execution)
🔹 Dynamic Partition Pruning
🔹 Pandas UDFs
Cloud Cost Optimization:
🔹 Storage
🔹 Compute
🔹 SQL Scenario
Problem:
From sales and refund tables, find customers who:
Cancelled an order (order_status = 'cancelled')
Did NOT get a refund
Placed another order after cancellation
Schema:
Sales Table:
| customer_id | order_id | order_plcd_ts | order_shipped_dt | order_amt | order_status |
Refund Table:
| order_id | order_refund_dt | refund_amt |
🔹 Git Question
git fetch vs git pull – conceptual and practical differences
Round 4 – Hiring Manager
Discussion Topics
🔹 Detailed discussion on past projects
🔹 CI/CD and pipeline automation
🔹 Airflow setup in AWS (MWAA)
🔹 Repartition vs Coalesce
🔹 When and why to use
🔹 Pros & cons
🔹 Role responsibilities and team expectations
✅ Received the Final Offer